$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} $$

Part 1: Mini-Project¶

In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.

Guidelines¶

  • You should implement the code which displays your results in this notebook, and add any additional code files for your implementation in the project/ directory. You can import these files here, as we do for the homeworks.
  • Running this notebook should not perform any training - load your results from some output files and display them here. The notebook must be runnable from start to end without errors.
  • You must include a detailed write-up (in the notebook) of what you implemented and how.
  • Explain the structure of your code and how to run it to reproduce your results.
  • Explicitly state any external code you used, including built-in pytorch models and code from the course tutorials/homework.
  • Analyze your numerical results, explaining why you got these results (not just specifying the results).
  • Where relevant, place all results in a table or display them using a graph.
  • Before submitting, make sure all files which are required to run this notebook are included in the generated submission zip.
  • Try to keep the submission file size under 10MB. Do not include model checkpoint files, dataset files, or any other non-essentials files. Instead include your results as images/text files/pickles/etc, and load them for display in this notebook.

Object detection on TACO dataset¶

you can read more about the dataset here: https://github.com/pedropro/TACO

and can explore the data distribution and how to load it from here: https://github.com/pedropro/TACO/blob/master/demo.ipynb

The stable version of the dataset that contain 1500 images and 4787 annotations exist in datasets/TACO-master You do not need to download the dataset.

Project goals:¶

  • You need to perform Object Detection task, over 7 of the dataset.
  • The annotation for object detection can be downloaded from here: https://github.com/wimlds-trojmiasto/detect-waste/tree/main/annotations.
  • The data and annotation format is like the COCOAPI: https://github.com/cocodataset/cocoapi (you can find a notebook of how to perform evalutation using it here: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb)

(you need to install it..)

  • if you need a beginner guild for OD in COCOAPI, you can read and watch this link: https://www.neuralception.com/cocodatasetapi/

What do i need to do?¶

  • Everything is in the game! as long as your model does not require more then 8 GB of memory and you follow the Guidelines above.

What does it mean?¶

  • you can use data augmentation, rather take what's implemented in the directory or use external libraries such as https://albumentations.ai/ (notice that when you create your own augmentations you need to change the annotation as well)
  • you can use more data if you find it useful (for examples, reviwew https://github.com/AgaMiko/waste-datasets-review)

What model can i use?¶

  • Whatever you want!

you can review good models for the coco-OD task as a referance: SOTA: https://paperswithcode.com/sota/object-detection-on-coco Real-Time: https://paperswithcode.com/sota/real-time-object-detection-on-coco Or you can use older models like YOLO-V3 or Faster-RCNN

  • As long as you have a reason (complexity, speed, preformence), you are golden.

Tips for a good grade:¶

  • start as simple as possible. dealing with APIs are not the easiest for the first time and i predict that this would be your main issue. only when you have a running model that learn, you can add learning tricks.
  • use the visualization of a notebook, as we did over the course, check that your input actually fitting the model, the output is the desired size and so on.
  • It is recommanded to change the images to a fixed size, like shown in here :https://github.com/pedropro/TACO/blob/master/detector/inspect_data.ipynb
  • Please adress the architecture and your loss function/s in this notebook. if you decided to add some loss component like the Focal loss for instance, try to show the results before and after using it.
  • Plot your losses in this notebook, any evaluation metric can be shown as a function of time and possibe to analize per class.

Good luck!

Implementation¶

Goal¶

in this project we fintuned and compared the performence of 3 well known object detection models - DETR, YOLO and Faster RCNN. each model has different architecture and represents a different learning strategy. our goal is to compare their performence on the TACO dataset and derive insights on each one of them seperately. below we represent this process

In [1]:
%load_ext autoreload
%autoreload 2
import sys
import json
sys.path.append('/home/ilay.kamai/mini_project/detr')
from PIL import Image as pil_image

Part 1 - Data Exploration¶

first, we explored the data - images and bounding boxes. we did it in two ways: manually, and using exploration functions. the manuall inspection gave a lot of insights and particularry showed that there are 2 problems with the annotations:

  1. there were annotations that points to non existance images
  2. there were annotations with missing 'segmentation' key. altough we don't need this key for the task, the key is necessary for compatability with some api's

we fixed those problems using the methods "filter_anns" and "fill_anns" in project/utils.py file. we also splitted the train annotation into validation and train. the file project/utils.py consists of data exploration methods. after that, we reformatted the dataset to fit the api's we worked with. the file project/modify_dataset.py creates data structure that fits the coco format. we used that for for DETR and faster-rcnn finetuning. The file project/yolo/create_taco_data.py creates files structure for YOLO model.


below is an example for some minimal data exploration
In [2]:
from project.utils import anns_hist
for s in ['train', 'val', 'test']:
    anns_hist(f"annotations_{s}.json")
Number of images: 946
Number of images with no annotations: 0
annotations DataFrame:
            Categories  Number of annotations  inv_density   weights
0  metals_and_plastic                   1520     1.986193  0.000997
2      non_recyclable                    848     3.558304  0.001785
6             unknown                    306     9.840391  0.004937
3               glass                    185    16.241935  0.008149
4               paper                    155    19.365385  0.009716
5                 bio                      6   431.571429  0.216537
1               other                      1  1510.500000  0.757878
Number of images: 237
Number of images with no annotations: 0
annotations DataFrame:
            Categories  Number of annotations  inv_density   weights
0  metals_and_plastic                    408     1.970660  0.001742
2      non_recyclable                    241     3.330579  0.002944
6             unknown                     71    11.194444  0.009895
4               paper                     53    14.925926  0.013194
3               glass                     31    25.187500  0.022265
5                 bio                      2   268.666667  0.237490
1               other                      0   806.000000  0.712470
Number of images: 317
Number of images with no annotations: 0
annotations DataFrame:
            Categories  Number of annotations  inv_density   weights
0  metals_and_plastic                   1179     2.356780  0.007526
6             unknown                    741     3.747978  0.011969
2      non_recyclable                    602     4.611940  0.014727
4               paper                    118    23.369748  0.074627
3               glass                     96    28.670103  0.091553
1               other                     28    95.896552  0.306229
5                 bio                     17   154.500000  0.493369

Part 2 - Finetuning models¶

after exploring the data we moved to train the first model


the first model we used is **DETR** (Detection Transformer) which was first suggested in https://arxiv.org/pdf/2005.12872.pdf.
DETR consists of convlutional backbone (resnet50) and a visual transformer encoder-decoder. the output of the decoder is a set of proposal bounding box predictions and class logits. the number of proposal bounding boxes is fixed (as this is the number of queries for the decoder). since we wanted to use a pretrained model, we used the same number as in the pretrained model (which was trained on COCO) - 100. the proposal bounding boxes are matched to the ground truth bounding boxes using the Hungarian algorithm which finds the best match between the proposal boxes and the ground truth boxes, given a distance metric (in our case this is a linear combination of NLL, L1 and generalized iou). all the boxes which was not matched to ground truth boxes are considerd "background". this forced us to include a "background" class. in addition, we noted that the label index of the first class is 1 (and not zero) so we added a "dummy" label called "N/A" with label 0.
the architecture of the model is shown below:
In [3]:
pil_image.open('imgs/detr_architecture.png')
Out[3]:

to finetune DETR, we used this fork from the original repo of DETR: https://github.com/woctezuma/detr/tree/finetune/models we modified this repo with the following main improvements:


1- we added focal loss class (in project/detr/models/detr.py) that can replace cross entropy for the labels classification
2- we added weights to the cross entropy loss (in project/detr/models/detr.py)
3- we added the possibility to freeze the backbone
4- we added new prediction functions and plots (all are in project/detr/detr_predict.py)
we used the dataset provided in the repo with only slight modifications (image reoreintation as shown in https://github.com/pedropro/TACO/blob/master/demo.ipynb).
in the dataset (in project/detr/datasets/coco.py), we prepare the labels and bounding boxes, clamp the boxes to the image size and filtered illegal boxes. we experimented with different augmentations - the transforms in the repo were implemented from scratch and proved to be better on the validation set and more effective than Albumentations transforms. we used the same augmentations used in the repo (applied for both the image and the annotations) - horizontal flip and a mixture of resize and crop for the training, and only resize for validation and test. we also normalized the images with Imagenet normalization and normalized the bounding boxes (in cxcywh format).
we tested finetuning of the entire model vs finetuning of the transformer and classification head only (freezing the backbone), and found that both gives very similar results, while finetuning the entire model was slightly better compared to the freezed backbone case. this makes sense because the task of the backbone is to process information in the feature level where the difference between TACO dataset and COCO dataset might be small but still exists. we therefore chose to train the entire model (41M parameters)

we also experimented with focal_loss vs cross entropy loss with weights. in order to find the best hyperparameters we used optuna package to search for an optimal combination (the code can be found in project/detr/opt.py). unfortunately, due to memory issues, we were able to run the optimization process only with batch_size=1 , which was not very effective, so we did most of the hyperparameters tuning manually. we tried to use the training-set inverse frequencies of the classes as weights but eventually found that a more "hard" weightening (were all classes execpt the 2 most frequent classes, has weight of 1) gave better results on the validation set.


below we will compare the results of training with focal loss vs training with cross entropy with weights.
the training is done in detr/main.py which calls to train_epoch and evaluation methods that are in detr/engine.
to train the model and make predictions you can run the following code (note that you need to change the dataset and checkpoint paths in detr.detr_finetune.py)
In [4]:
# import project.detr.detr_finetune as detr_finetune
# detr_finetune.finetune()

DETR results -

first we look at the training process graphs with focal loss:

In [5]:
im1 = pil_image.open('imgs/detr/mAP_loss_focal.png')
im2 = pil_image.open('imgs/detr/losses_focal.png')
im1.show()
im2.show()

now we look at the training with cross entropy with weights

In [6]:
# from IPython.display import display
# from IPython.display import Image


im1 = pil_image.open('imgs/detr/mAP_loss_full.png')
im2 = pil_image.open('imgs/detr/losses_full.png')
im1.show()
im2.show()

note that to compare between them we need to look at the accuracy, mAP, and not the loss value (ass those are different functions). the mAP metric in the above graphs is mAP25-75 - the average among iou-thresholds [0.25,0.50,0.75]. To calculate the mAP we used torchmetric package (we also imlemented it from scratch in project/detr/detr_predict.py. we got the same results and found this package very easy to use so we used it). we see that the focal loss mAP is much lower than cross entropy with weights. another interesting obseravtion is by looking at the specific loss elements. when using focal loss, the limiting loss (one that is overfitting) is the giou (general intersection over union), while when using cross entropy it is the cross entropy itself (when using focal loss the loss_ce refers to the focal loss). this implies that focal loss indeed improve the labels classification but at the expense of the bounding boxes iou. the fact that cross entropy with weights was better than focal loss can be explained by the fact that the data in not highly imbalanced and the improvement for the rare classes we get from the focal term is at the expense of the accuracy of the frequent classes. this also might be a results of non optimal parameters when using focal loss.


we also see that the training with cross entropy stops faster than focal loss. we think that with a more riguros hyperparameter tuning, this might be improved (with manuall tuning we didn't managed to get better results)

next, we trained the model on the entire train set, for the same number of epochs and evaluated the accuracy on the test set. the results of the last training are shown below:

In [7]:
im1 = pil_image.open('imgs/detr/test_loss.png')
im2 = pil_image.open('imgs/detr/test_losses.png')
im1.show()
im2.show()

now let's examine the predictions. first in a more visual way. below are sample predictions together with the prediction probability on top of the real boxes and labels:

In [8]:
for p in ['0', '30', '60', '90']:
    im = pil_image.open('imgs/detr/preds_{}.png'.format(p))
    im.show()

we can see that the prediction are overall really good. bounding boxes are almost prefectly aligned and the labels are usually correct (but not always).


it is also interesting to look at the attention weights. the following graphs shows the decoder attention weights (reshaped to the size of the last feature map) of all the heads of the last layer, for the same samples from above (the blue rectangle represents the true bounding box):
In [9]:
for a in ['0', '30', '60', '90']:
    print(f'sample_{a}')
    for b in ['0', '1']:
        im = pil_image.open('imgs/detr/attn_{}_{}_attn.png'.format(a,b))
        im.show()
sample_0
sample_30
sample_60
sample_90

by looking at the attention weights, it is clear that there is a connection between the bounding box and the pixels with high attention. the meaning is that the model learned to focus on the desired objects. nevertheless, the are examples where the attention weights are high for parts in the image that are not of our interest. this can be understood as a remenant of the pretraining phase, where the model trained on dataset with much more classes.


the code for plotting the samples and the attention weights can be found at project/detr/util/plot_utils.py
lastly, we would like to look at the classes disribution of the model predictions vs the real distribution:
In [10]:
pil_image.open('imgs/detr/cls_dist.jpeg')
Out[10]:

we see that overall, the predictions distributions fit the real ones. nevertheless, there is an over-prediction of the third class and under prediction of the sixth class, roughly to the same amount. the explanation for that is the differences in the real classes distributions between train and test sets. we saw at the beginning of the notebook that there is a difference between those classes ('unknown' and 'non_recycable') frequencies. since we trained on the train distribution and tested on the test distribution, we expect this change to be reflected in the results and this is exactly what we observe. (note that at the beginning of the notebook all the annotations are shown and here only the legal ones. still, we expect that at least quantitavely, to see the effect).


The mAP results of the test set are the following:
In [11]:
detr_acc_path = 'imgs/detr/test_acc.json'
with open(detr_acc_path, 'r') as json_file:
    data = json.load(json_file)

print(json.dumps(data, indent=4))
{
    "map": 0.28570878505706787,
    "map_50": 0.3183254599571228,
    "map_75": 0.16300788521766663,
    "map_small": 0.058816004544496536,
    "map_medium": 0.3416192829608917,
    "map_large": 0.5460426807403564,
    "mar_1": 0.2738052010536194,
    "mar_10": 0.35033294558525085,
    "mar_100": 0.35442739725112915,
    "mar_small": 0.11222628504037857,
    "mar_medium": 0.403219074010849,
    "mar_large": 0.5912919640541077,
    "map_per_class": [
        0.48206546902656555,
        0.0,
        0.31224218010902405,
        0.25925925374031067,
        0.6606858372688293,
        0.0,
        -1.0
    ],
    "mar_100_per_class": [
        0.5789473652839661,
        0.0,
        0.47455471754074097,
        0.3541666567325592,
        0.6879432797431946,
        0.03095238097012043,
        -1.0
    ],
    "classes": [
        1,
        2,
        3,
        4,
        5,
        7,
        8
    ]
}

specifically we see mAP50 of 0.32 and mAP25-75 of 0.29.


diving deeper into the mAP we see that the model perfomence on small objects (mAP_small) if by far the limiting task of the model (0.058 vs 0.34 and 0.55 for medium and large). this might be a results of the spatial resolution of the images. using heirarchical learning, like in Swin transformers for example, might also help in this aspect

next, we trained YOLOv8 model. YOLO (You Only Look Once) is CNN based architecture that aims to detect and classify objects in images with real-time efficiency. YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly within each grid cell.


the architecture is the following:
In [12]:
im = pil_image.open('imgs/yolo_architecture.jpg')
im.show()

to train YOLO we used the built-in ultralytics train function without modifying the source code at all. instead, we used the high-level api which allows to control various training paramters by sending keword arguments to the model.train method. nevertheless, one of the changes that we do needed to do in order to comply with ultralytics API, was to change the directories structure and the annotations format (from COCO to YOLO format). this was done using the code in project/yolo/create_tako_data.py.


we tested different model sizes - nano, small and large. obviously, the preformances of the large model were the best. using the nano model, we tested finetuning of the head only vs training the entire model. we found that, similar to what we observed in DETR, training the entire model was better. the fact that both YOLO and DETR performed better when training the entire model is a strong evidence for the following - even though there is a big similarity between the low level features of the COCO task and the TACO task, there are still some differences and learning of features is still needed. It is possible, for example, that most everyday items are recognized by the model through the geometric shapes of their contours, while for waste items the geometric shape of the item is less important, and its texture is also highly important. it is also possible that on average, waste objects are smaller than general objects.
to run yolo model you can run the following cell (but first you need to make sure the data is organized properly and to update project/yolo/datasets.yaml ):
In [13]:
# import project.yolo.yolo_finetune as yolo_finetune
# yolo_finetune.finetune()

below is a comparison between the different models sizes:

In [14]:
nano = pil_image.open('imgs/yolo/results_nano.png')
small = pil_image.open('imgs/yolo/results_small.png')
large = pil_image.open('imgs/yolo/results_large.png')
print('nano:')
nano.show()
print("small:")
small.show()
print("large:")
large.show()
nano:
small:
large:

it can be seen that the difference in performence between small and large is ~1% mAP while the difference in the number of parameters is ~20M. this is not an ideal trade-off.


we used the default dataset transformations (translation, scaling and fliplr) and experimented with different image sizes, and found no difference between sizes in the range 864-1200. we experimented with different optimizers and cross entropy with labels smoothing (a variation of cross entropy where all the labels have some small effect on the prediction and not only the correct one). again, a more extensive hyperparameters optimization process would probably yield slghtly better results.

precision and recall on the validation set can be seen below:

In [15]:
pil_image.open('imgs/yolo/PR_curve.png').show()
pil_image.open('imgs/yolo/P_curve.png').show()
pil_image.open('imgs/yolo/R_curve.png').show()
# display(pr, p, r)

the Recall-Confidence graph and Precision-confidence graph shows an interesting phenomena - interperting both graphs we see that higher confidence threshold increases the precision, but lowers the recall. the meaning is that as the confidence threshold increaases, the model "narrowing" and "refining" its predictions (small recall = "narrowing", high precision = "refining"). when the recall reaches 0 the process stops and this is where we see the straight line in the precision curve (constant precision). we also see that the decrease in the recall is much more drastic for the rare labels. this is a result of the class imbalance. since there are less samples for rare labels, the model fail to recognize them and focus on the the more frequent labels.


next, we tested the model. below are sample predictions on the test set:
In [16]:
for b in ['0', '1', '2']:
    for p in ['labels', 'pred']:
        im = pil_image.open('imgs/yolo/val_batch{}_{}.jpg'.format(b,p))
        print(p, b)
        im.show()
labels 0
pred 0
labels 1
pred 1
labels 2
pred 2

we can see that overall the predictions are not great - there are missing bounding boxes and incorrect classes.


it is also interesting to look at the confusion matrix of predictions:
In [17]:
pil_image.open('imgs/yolo/confusion_matrix_normalized.png').show()

the confusion matrix above shows that the model was able to learn only 2 classes (background means no class).


the results on the test set are the following:
In [18]:
pil_image.open('imgs/yolo/PR_curve_test.png').show()
pil_image.open('imgs/yolo/P_curve_test.png').show()
pil_image.open('imgs/yolo/R_curve_test.png').show()

on the test set we got mAP50 of 0.0934 and mAP50-95 of 0.0662

from the above plots and images we can clearly see the difference between classes - the most representable class has the best performence and the model overpredict it, also in cases were the real label is different. this is expected for class imbalance and we expect that modifying the loss to mitigate that (with weights or focal loss) should improve performence. we didn't do it because we found the api of ultralytics very complex (when trying to modify functions rather than training the model as is) and we didn't have the time to invest in this route.


another observation is that the performence of YOLO were much worse than those of DETR.

lastly, we moved to train faster-rcnn model. faster-rcnn is a convolution based model with 2 stages - region proposal (RPN) and fast rcnn network. both part have different roles - RPN creates proposals for bounding boxes and the rcnn netwrok use those proposals (in the feature-map space) to predict bounding boxes and classes.


the architecture of the model is the following:
In [19]:
im = pil_image.open('imgs/rcnn_architecture.jpg')
im.show()

we used pytorch model (torchvision.models.detection) and pretrained weights.


following the previous experiments, we decided to train the entire model rather than freezing the backbone.
in order to add flexibility, we did the following modifications:
1. we added custom loss function. we did so by overriding the original loss function at torchvision.models.detection.roi_heads.fastrcnnloss with a new one (in project/rcnn/util/misc.py). we created a focal loss class (similar to what we did in DETR) and cross entropy with labels smoothing. although this hack works, it is not ideal as the call to the loss function is happenning inside the forwad method of the model (in training phase) which means that we cannot change its arguments so we added deafault arguments for the focal loss and we did the hyperparmeters optimization manually. 2. we created a custom evaluation method (in project/rcnn/train_rcnn.py) that returns the loss dictionary and the predictions. 3. we changed the sizes of anchors in the rpn. we created a custom AnchorGenerator and sent it to the rpn. this is important since the objects in TACO are not neccessarily the same size like objects in COCO (which the model was trained on).

to train the model we used a modified version of the Trainer class we used in the hw (in project/rcnn/train_rcnn.py). we used the same dataset and augmentations and followed the same learning tricks as in DETR (early stopping, weight decay, etc.).

to run faster-rcnn model, run the following cell (again, you need to modify the paths to fit your paths)

In [20]:
# import project.rcnn.fasterrcnn as fasterrcnn
# fasterrcnn.finetune()

below are the training graphs that compare the results of focal loss and smoothed cross entropy on the training and validation set

In [21]:
pil_image.open('imgs/rcnn/mAP_loss_all.png').show()

we see that focal loss gives better results than ce. we used focal loss and trained the model on the entire training set and evaluated it on the test set.


The results are as follows:
In [22]:
pil_image.open('imgs/rcnn/test_loss.png').show()

below are sample predictions on the test set:

In [23]:
for p in ['0', '30', '60', '90']:
    im = pil_image.open('imgs/rcnn/preds_{}.png'.format(p))
    im.show()

looking at the sample predictions we see that the bounding boxes are usually good. this is probably due to the change in anchor sizes, that enable the model to process small objects, and the nms (non-maximum supression) threshold that removes overlappes. however, the labels predictions are effectively majority class classifier. the model always predict the most frequent class. we expected that focal loss would mitigate this but didn't get good results in that regard. this is interesting by itself since for DETR we found that cross entropy outperformed focal loss and here we see the opposite. this might be realted to the abbility of transformers to distinguish between classes using attetntion (giving different attention weights to different classes) like we saw in the class. anyway, maybe a more in-depth hyper-parameters tuning would improve this.

the class distributions and mAP results on the test set are:

In [24]:
pil_image.open('imgs/rcnn/cls_dist.jpeg').show()

rcnn_acc_path = 'imgs/rcnn/test_acc.json'
with open(rcnn_acc_path, 'r') as json_file:
    data = json.load(json_file)

print(json.dumps(data, indent=4))
{
    "map": 0.08888889104127884,
    "map_50": 0.08897058665752411,
    "map_75": 0.0887254923582077,
    "map_small": 0.0361669659614563,
    "map_medium": 0.1666666716337204,
    "map_large": 0.20000000298023224,
    "mar_1": 0.06625016778707504,
    "mar_10": 0.134905144572258,
    "mar_100": 0.14788758754730225,
    "mar_small": 0.1472356617450714,
    "mar_medium": 0.16577060520648956,
    "mar_large": 0.1944444477558136,
    "map_per_class": [
        0.5333333611488342,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
    ],
    "mar_100_per_class": [
        0.8835088014602661,
        0.0,
        0.003816793905571103,
        0.0,
        0.0,
        0.0
    ],
    "classes": [
        1,
        2,
        3,
        4,
        5,
        7
    ]
}

specifically we get mAP50 - 0.089 and mAP25-75 - 0.089.


looking at the class distributions the picture is more clear. we see a similar case of what was observed with YOLO - the model was able to predict almost only the majority class and the results are worse than the previous models, but not far from YOLO. comparing YOLO and RCNN prediciotns visually, we see that those of RCNN are more accurate. this is likely due to the RPN and the fact we can tune the anchore sizes to optimize the bounding boxes locations

conclusions¶

the following graph present a comparison between all the three models based on the test-set mAP50 and the number of parameters

In [25]:
import matplotlib.pyplot as plt
accs = [32, 9.34, 8.9]
names = ['DETR', 'YOLOv8', 'Faster-RCNN']
params = [41.3, 43.7, 43]

for a,p,n in zip(accs, params, names):
    plt.scatter(p, a, s=p*5, label=n, alpha=0.5)
plt.title("Accuracy TACO object detection")
plt.xlabel('number of parameters (M)')
plt.ylabel("mAP50")
plt.legend()
plt.show()

when comapring the accuracy of the three models together with the number of paramters it is clear that DETR is much better than YOLO and Faster rcnn. all three models share similar number of parameters but the accuracy of DETR (which has even less parameters than the other two) is much better. also by looking at the sample predictions we can see that the number of bounding boxes, their locations and the labels confidence are better in DETR than those of YOLO and RCNN. we also observed that this was achieved even with minimal training of DETR, which points on the power of attention and the set matching loss. to get even better results with DETR we suggests the following:


1. futher investigation of focal loss or other alternatives to cross entropy.
2. using transformers as the backbone (like SWIN transformers)

both directions requires time and resources that we didn't have during this project.


another interesting observation is the training time - training DETR and RCNN took ~5-6 minutes per epoch while training YOLO took ~2.5-3. we didn't take accurate measurements but it is clear that YOLO is much faster than DETR and RCNN. the fact that YOLO is faster than RCNN is known and not surprising (one stage model vs two stages model), but the difference with DETR is more surprising since DETR has the advantage of paralallelism using the transformer (it is not autoregressive and can process all queries in parallel). this is another interesting topic that we didn't have time to investigate
**to conclude -** we compared the finetuning of three known models on the TACO dataset. each model suggests different architecture and we found that each model has its own pros and cones. we found that in terms of mAP accuracy, DETR, a transformer based model, is favorable over the other two. we really enjoyed the project and wish we had more time/resources to deepen the investigation.

Theoretical Questions¶

  • This is the theoretical part of the final project. It includes theoretical questions from various topics covered in the course.
  • There are 7 questions among which you need to choose 6, according to the following key:
    • Question 1 is mandatory.
    • Choose one question from questions 2-3.
    • Question 4 is mandatory.
    • Questions 5-6 are mandatory.
    • Question 7 is mandatory.
  • Question 1 is worth 15 points, whereas the other questions worth 7 points.
  • All in all, the maximal grade for this parts is 15+7*5=50 points.
  • You should answer the questions on your own. We will check for plagiarism.
  • If you need to add external images (such as graphs) to this notebook, please put them inside the 'imgs' folder. DO NOT put a reference to an external link.
  • Good luck!

Part 1: General understanding of the course material¶

Question 1¶

  1. Relate the number of parameters in a neural network to the over-fitting phenomenon (*). Relate this to the design of convolutional neural networks, and explain why CNNs are a plausible choice for an hypothesis class for visual classification tasks.

    (*) In the context of classical under-fitting/over-fitting in machine learning models.

  1. Consider the linear classifier model with hand-crafted features: $$f_{w,b}(x) = w^T \psi(x) + b$$ where $x \in \mathbb{R}^2$, $\psi$ is a non-learnable feature extractor and assume that the classification is done by $sign(f_{w,b}(x))$. Let $\psi$ be the following feature extractor $\psi(x)=x^TQx$ where $Q \in \mathbb{R}^{2 \times 2}$ is a non-learnable positive definite matrix. Describe a distribution of the data which the model is able to approximate, but the simple linear model fails to approximate (hint: first, try to describe the decision boundary of the above classifier).
  1. Assume that we would like to train a Neural Network for classifying images into $C$ classes. Assume that the architecture can be stored in the memory as a computational graph with $N$ nodes where the output is the logits (namely, before applying softmax) for the current batch ($f_w: B \times Ch \times H \times W \rightarrow B \times C$). Assume that the computational graph operates on tensor values.
    • Implement the CE loss assuming that the labels $y$ are hard labels given in a LongTensor (as usual). Use Torch's log_softmax and index_select functions and implement with less as possible operations.
In [1]:
# from torch.nn.functional import log_softmax
# from torch import gather
# import torch
# Input:  model, x, y. 
# Output: the loss on the current batch.
# logits = model(x)
# log_probs = log_softmax(logits, dim=1)
# gathered_log_probs = gather(log_probs, 1, y.view(-1, 1))
# loss = -torch.mean(gathered_log_probs)
  • Using the model's function as a black box, draw the computational graph (treating both log_softmax and index_select as an atomic operations). How many nodes are there in the computational graph?
  • Now, instead of using hard labels, assume that the labels are representing some probability distribution over the $C$ classes. How would the gradient computation be affected? analyze the growth in the computational graph, memory and computation.
  • Apply the same analysis in the case that we would like to double the batch size. How should we change the learning rate of the optimizer?

Part 2: Optimization & Automatic Differentiation¶

Question 2: resolving gradient conflicts in multi-task learning¶

Assume that you want to train a model to perform two tasks: task 1 and task 2. For each such task $i$ you have an already implemented function loss_i = forward_and_compute_loss_i(model,inputs) such that given the model and the inputs it computes the loss w.r.t task $i$ (assume that the computational graph is properly constructed). We would like to train our model using SGD to succeed in both tasks as follows: in each training iteration (batch) -

  • Let $g_i$ be the gradient w.r.t the $i$-th task.
  • If $g_1 \cdot g_2 < 0$:
    • Pick a task $i$ at random.
    • Apply GD w.r.t only that task.
  • Otherwise:
    • Apply GD w.r.t both tasks (namely $\mathcal{L}_1 + \mathcal{L}_2$).

Note that in the above formulation the gradient is a thought of as a concatination of all the gradient w.r.t all the models parameters, and $g_1 \cdot g_2$ stands for a dot product.

What parts should be modified to implement the above? Is it the optimizer, the training loop or both? Implement the above algorithm in a code cell/s below

Question 3: manual automatic differentiation¶

Consider the following two-input two-output function: $$ f(x,y) = (x^2\sin(xy+\frac{\pi}{2}), x^2\ln(1+xy)) $$

  • Draw a computational graph for the above function. Assume that the unary atomic units are squaring, taking square root, $\exp,\ln$, basic trigonometric functions and the binary atomic units are addition and multiplication. You would have to use constant nodes.
  • Calculate manually the forward pass.
  • Calculate manually the derivative of all outputs w.r.t all inputs using a forward mode AD.
  • Calculate manually the derivative of all outputs w.r.t all inputs using a backward mode AD.

Part 3: Sequential Models¶

Question 4: RNNs vs Transformers in the real life¶

In each one of the following scenarios decide whether to use RNN based model or a transformer based model. Justify your choice.

  1. You are running a start-up in the area of automatic summarization of academic papers. The inference of the model is done on the server side, and it is very important for it to be fast.
  2. You need to design a mobile application that gathers small amount of data from few apps in every second and then uses a NN to possibly generate an alert given the information in the current second and the information from the past minute.
  3. You have a prediction task over fixed length sequences on which you know the following properties:
    • In each sequence there are only few tokens that the model should attend to.
    • Most of the information needed for generating a reliable prediction is located at the beginning of the sequence.
    • There is no restriction on the computational resources.

Part 4: Generative modeling¶

Question 5: VAEs and GANS¶

Suggest a method for combining VAEs and GANs. Focus on the different components of the model and how to train them jointly (the objectives). Which drawbacks of these models the combined model may overcome? which not?

Question 6: Diffusion Models¶

Show that $q(x_{t-1}|x_t,x_0)$ is tractable and is given by $\mathcal{N}(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)$ where the terms for $\tilde{\mu}(x_t,x_0)$ and $\tilde{\beta_t}$ are given in the last tutorial. Do so by explicitly computing the PDF.

Part 5: Training Methods¶

Question 7: Batch Normalization and Dropout¶

For both BatchNorm and Dropout analyze the following:

  1. How to use them during the training phase (both in forward pass and backward pass)?
  2. How differently they behave in the inference phase? How to distinguish these operation modes in code?
  3. Assume you would like to perform multi-GPU training (*) to train your model. What should be done in order for BatchNorm and dropout to work properly? assume that each process holds its own copy of the model and that the processes can share information with each other.

(*): In a multi-GPU training each GPU is associated with its own process that holds an independent copy of the model. In each training iteration a (large) batch is split among these processes (GPUs) which compute the gradients of the loss w.r.t the relevant split of the data. Afterwards, the gradients from each process are then shared and averaged so that the GD would take into account the correct gradient and to assure synchornization of the model copies. Note that the proccesses are blocked between training iterations.